EDS 222: Assignment 01

Gabrielle Benoit

Assigned: 09/28, due 10/09 5pm

(The case study in this exercise is based on reality, but does not include actual observational data.)

Air Pollution in Lahore, Pakistan

In this exercise we will look at a case study concerning air quality in South Asia. The World Health Organization estimates that air pollution kills an estimated seven million people per year, due to its effects on the cardiovascular and respiratory systems. Out of the 40 most polluted cities in the world, South Asia is home to 37, and Pakistan was ranked to contain the second most air pollution in the world in 2020 (IQAIR, 2020). In 2019, Lahore, Pakistan was the 12th most polluted city in the world, exposing a population of 11.1 million people to increased mortality and morbidity risks.

In this exercise, you are given two datasets1 All data for EDS 222 will be stored on the Taylor server, in the shared /courses/EDS222/data/ directory. Please see material from EDS 214 on how to access and retrieve data from Taylor. These data are small; all compute can be handled locally. Thanks to Bren PhD student Fatiq Nadeem for assembling these data! from Lahore, Pakistan and are asked to compare the two different data collection strategies from this city. These data are:

In answering the following questions, please consider the lecture content from class on sampling strategies, as well as the material in Chapter 2 of Introduction to Modern Statistics. Include in your submission an .Rmd file and a compiled .html file, each containing complete answers to all questions (as well as all your code in the .Rmd).

Question 1:

Load the data from each source and label it as crowdsourced and govt accordingly. For example:

  1. These dataframes have one row per pollution observation. How many pollution records are in each dataset?
#> [1] 5488    4
#> [1] 1960    4
  1. Each monitor is located at a unique latitude and longitude location. How many unique monitors are in each dataset?2 Hint: group_by(longitude,latitude) and cur_group_id() in dplyr will help in creating a unique identifier for each (longitude, latitude) pair.

cur_group_id() gives a unique numeric identifier for the current group.

#> # A tibble: 5,488 × 4
#> # Groups:   longitude, latitude [14]
#>   date          PM longitude latitude
#>   <date>     <dbl>     <dbl>    <dbl>
#> 1 2018-11-04    71      74.4     31.6
#> 2 2018-11-05    51      74.4     31.6
#> 3 2018-11-06    63      74.4     31.6
#> 4 2018-11-07    89      74.4     31.6
#> 5 2018-11-08    29      74.4     31.6
#> 6 2018-11-09    43      74.4     31.6
#> # ℹ 5,482 more rows
#> # A tibble: 5,488 × 5
#>   date          PM longitude latitude    id
#>   <date>     <dbl>     <dbl>    <dbl> <int>
#> 1 2018-11-04    71      74.4     31.6     1
#> 2 2018-11-05    51      74.4     31.6     1
#> 3 2018-11-06    63      74.4     31.6     1
#> 4 2018-11-07    89      74.4     31.6     1
#> 5 2018-11-08    29      74.4     31.6     1
#> 6 2018-11-09    43      74.4     31.6     1
#> # ℹ 5,482 more rows
#> # A tibble: 1,960 × 4
#> # Groups:   longitude, latitude [5]
#>   date          PM latitude longitude
#>   <date>     <dbl>    <dbl>     <dbl>
#> 1 2018-11-04    28     31.6      74.3
#> 2 2018-11-05    34     31.6      74.3
#> 3 2018-11-06    44     31.6      74.3
#> 4 2018-11-07    60     31.6      74.3
#> 5 2018-11-08    25     31.6      74.3
#> 6 2018-11-09    60     31.6      74.3
#> # ℹ 1,954 more rows
#> # A tibble: 1,960 × 5
#>   date          PM latitude longitude    id
#>   <date>     <dbl>    <dbl>     <dbl> <int>
#> 1 2018-11-04    28     31.6      74.3     1
#> 2 2018-11-05    34     31.6      74.3     1
#> 3 2018-11-06    44     31.6      74.3     1
#> 4 2018-11-07    60     31.6      74.3     1
#> 5 2018-11-08    25     31.6      74.3     1
#> 6 2018-11-09    60     31.6      74.3     1
#> # ℹ 1,954 more rows

Question 2:

The goal of pollution monitoring in Lahore is to measure the average pollution conditions across the city.

  1. What is the population in this setting? Please be precise. The population is all individuals living in Lahore, which according to the introduction to this assignment, is roughly 11.1 million people.

  2. What are the samples in this setting? Please be precise. The sample is only 7448 (5,488 + 1,960) homes, which may contain more than one individual. According to the intro to this assignment, the crowdsourced data is gathering air quality data from people who chose to install the monitor (5,488 homes), and chose to upload their data for public access. The government data is located in 1,960 homes, and it is likely biased toward cleaner air locations in order to reduce the pressure from both domestic and international groups urging Lahore, Pakistan officials to take steps to improve the air quality.

  3. These samples were not randomly collected from across locations in Lahore. Given the sampling approaches described above, discuss possible biases that may enter when we use these samples to construct estimates of population parameters.

The crowdsourced data is likely biased toward homes where the individuals are knowledgeable about the harms of air pollution, and therefore motivated to contribute to citizen science efforts that will demonstrate, with data, the lived experience of air pollution. The individuals are likely well educated, housed in permanent structures, and civicly engaged.

The govertment data is likely biased toward clean air; the locations of the air monitored are likely (1) in wealthy homes/buildings that have advanced infrastructure or are simply located in less congested parts of Lahore (2) located in areas that geographically lend themselves to less air pollution, so not valleys by mountain ranges, for example.

Question 3:

  1. For both the government data and the crowd-sourced data, report the sample mean, sample minimum, and sample maximum value of PM 2.5 (measured in \(\mu g/m^3\)).
#> [1] 70.2
#> [1] 39.6
#> [1] 20
#> [1] 15
#> [1] 120
#> [1] 65

  1. Discuss any key differences that you see between these two samples. The mean, min, and max of PM 2.5 in the government samples are consistently almost half the value of that in the crowdsourced data. This demonstrates that the government sample that assesses the population measure of air pollution is NOT describing the full range of air pollution experiences in Lahore.

  2. Are the differences in mean pollution as expected, given what we know about the sampling strategies?

Yes, these differences are expected, but the extent of it was a surprise!

Question 4:

Use the location of the air pollution stations for both of the sampling strategies to generate a map showing locations of each observation. Color the two samples with different colors to highlight how each sample obtains measurements from different parts of the city.3 Hint: longitude indicates location in the x-direction, while latitude indicates location in the y-direction. With ggplot2 this should be nothing fancy. We’ll do more spatial data in R later in the course.

Question 5:

The local newspaper in Pakistan, Dawn, claims that the government is misreporting the air pollution levels in Lahore. Do the locations of monitors in question 4, relative to crowd-sourced monitors, suggest anything about a possible political bias?

Yes, it appears the government is misreporting the air pollution levels in Lahore. As we can see in the previous question, the crowdsourced data monitores are pretty spread out, whereas the government monitors are much fewer in number, and clustered in the northeast part of Lahore.

Question 6:

Given the recent corruption in air quality reporting, the Prime Minister of Pakistan has hired an independent body of environmental data scientists to create an unbiased estimate of the mean PM 2.5 across Lahore using some combination of both government stations and crowd sourced observations.

NASA’s satellite data indicates that the average PM across Lahore is 89.2 \(\mu g/m^3\). Since this is the most objective estimate of population-level PM 2.5 available, your goal is to match this mean as closely as possible by creating a new ground-level monitoring sample that draws on both the government and crowd-sourced samples.

Question 6.1:

First, generate a random sample of size \(n=1000\) air pollution records by (i) pooling observations across the government and the crowd-sourced data;4 Hint: bind_rows() may be helpful. and (ii) drawing observations at random from this pooled sample.

Second, create a stratified random sample. Do so by (i) stratifying your pooled data-set into strata of 0.01 degrees of latitude, and (ii) randomly sampling 200 air pollution observations from each stratum.

Question 6.2:

Compare estimated means of PM 2.5 for each sampling strategy to the NASA estimate of 89.2 \(\mu g/m^3\). Which sample seems to match the satellite data best? What would you recommend the Prime Minister do? Does your proposed sampling strategy rely more on government or on crowd-sourced data? Why might that be the case?

The proposed sampling is MUCH closer to the NASA estimate. Therefore, it much rely more heavily on the crowdsourced data. This might be the case because (1) there were many more individual observations in the crowdsourced data (2) there was more variety in the location of the crowdsourced data, and therefore they were represented in more strata. I would recommend that the Prime Minister utilize this strategy, and that they require third party data analysis for environmental topics, such as air pollution, in order to reduce mis-represented data that paints a more favorable, yet inaccurate picture of reality.

#> [1] 15
#> [1] 120
#> [1] 64.1
#> [1] 70.2
#> [1] 39.6
#> [1] 20
#> [1] 15
#> [1] 120
#> [1] 65